Evaluation of a Dutch stemming algorithm
نویسندگان
چکیده
In state of the art Information Retrieval (IR) systems the most salient problem is to improve recall rates while retaining high precision. A simple recall enhancing technique which can be useful for even the simplest boolean retrieval systems is stemming. It is obvious that an information-seeker who is looking for texts about, for example, dogs is probably interested in a text which contains the word dog. An algorithm which maps different morphological variants to their base form (stem) is called a stemming algorithm. The underlying assumption for fruitful usage of such a stemmer is that morphological variants of words are semantically related. This is obviously not always true. In information retrieval, the use of stemming is controversial (cf. Harman 1991). However, several authors (Frakes and Baeza-Yates 1992; Krovetz 1993; Popovic̆ and Willett 1992) report favourable results1. The UPLIFT project2 investigates whether linguistic tools can improve the performance of an IR system for Dutch texts. As a first step we will adapt and test two common stemming techniques for Dutch text. The first technique, quite popular in several experimental and commercial IR systems is suffix stripping. Suffix stripping is a pragmatic approach. The algorithms are small and efficient and are not hampered by linguistic claims. Efficiency is an important property of every subpart of an IR system, especially for modern interactive systems. However, the simple architecture of such algorithms has drawbacks, such as the easy introduction of errors. The second technique, Stemming based on morphological analysis, requires more complex resources. This approach tries to exploit linguistic knowledge about the internal structure of word forms. A necessary component for such a morphological analysis is a dictionary. In general, each word which has to be stemmed will involve dictionary lookup and therefore this technique will be considerably slower than suffix stripping. On the other hand, such careful
منابع مشابه
Porter’s stemming algorithm for Dutch
A stemming algorithm provides a simple means to enhance Recall in Text Retrieval systems. The paper describes the development of a Dutch version of the Porter stemming algorithm. The stemmer was evaluated using a method inspired by Paice (Paice, 1994). The evaluation method is based on a list of groups of morphologically related words. Ideally, each group must be stemmed to the same root. The r...
متن کاملAccurate Stemming of Dutch for Text Classification
This paper investigates the use of stemming for classification of Dutch (email) texts. We introduce a stemmer, which combines dictionary lookup (implemented efficiently as a finite state automaton) with a rule-based backup strategy and show that it outperforms the Dutch Porter stemmer in terms of accuracy, while not being substantially slower. For text classification, the most important propert...
متن کاملIntegrating Linguistic Knowledge in Passage Retrieval for Question Answering
In this paper we investigate the use of linguistic knowledge in passage retrieval as part of an open-domain question answering system. We use annotation produced by a deep syntactic dependency parser for Dutch, Alpino, to extract various kinds of linguistic features and syntactic units to be included in a multi-layer index. Similar annotation is produced for natural language questions to be ans...
متن کاملImproving Precision in Information Retrieval for Swedish using Stemming
We will in this paper present an evaluation of how much stemming improves precision in information retrieval for Swedish texts. To perform this, we built an information retrieval tool with optional stemming and created a tagged corpus in Swedish. We know that stemming in information retrieval for English, Dutch and Slovenian gives better precision the more inflecting the language is, but precis...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1995